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RANDOMIZED ITERATIVE METHODS FOR LINEAR SYSTEMS 


ROBERT M. GOWER t PETER RICHTARIK * 

Abstract. We develop a novel, fundamental and surprisingly simple randomized iterative method 
for solving consistent linear systems. Our method has six different but equivalent interpretations: 
sketch-and-project, constrain-and-approximate, random intersect, random linear solve, random up¬ 
date and random fixed point. By varying its two parameters—a positive definite matrix (defining 
geometry), and a random matrix (sampled in an i.i.d. fashion in each iteration)—we recover a com¬ 
prehensive array of well known algorithms as special cases, including the randomized Kaczmarz 
method, randomized Newton method, randomized coordinate descent method and random Gaussian 
pursuit. We naturally also obtain variants of all these methods using blocks and importance sam¬ 
pling. However, our method allows for a much wider selection of these two parameters, which leads 
to a number of new specific methods. We prove exponential convergence of the expected norm of the 
error in a single theorem, from which existing complexity results for known variants can be obtained. 
However, we also give an exact formula for the evolution of the expected iterates, which allows us to 
give lower bounds on the convergence rate. 
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1. Introduction. The need to solve linear systems of equations is ubiquitous 
in essentially all quantitative areas of human endeavour, including industry and sci¬ 
ence. Linear systems are a central problem in numerical linear algebra, and play an 
important role in computer science, mathematical computing, optimization, signal 
processing, engineering, numerical analysis, computer vision, machine learning, and 
many other fields. 

For instance, in the field of large scale optimization, there is a growing interest in 
inexact and approximate Newton-type methods for [7, 11, 1, 40, 39, 13], which can 
benefit from fast subroutines for calculating approximate solutions of linear systems. 
In machine learning, applications arise for the problem of finding optimal configura¬ 
tions in Gaussian Markov Random Fields [32] , in graph-based semi-supervised learning 
and other graph-Laplacian problems [2], least-squares SVMs, Gaussian processes and 
more. 

In a large scale setting, direct methods are generally not competitive when com¬ 
pared to iterative approaches. While classical iterative methods are deterministic, 
recent breakthroughs suggest that randomization can play a powerful role in the de¬ 
sign and analysis of efficient algorithms [38, 19, 22, 9, 41, 18, 21, 29] which are in 
many situations competitive or better than existing deterministic methods. 

1.1. Contributions. Given a real matrix A £ R mxn and a real vector b £ R m , 
in this paper we consider the linear system 

(1.1) Ax = b. 

We shall assume throughout that the system is consistent: there exists x* for which 
Ax* = b. 
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We now comment on the main contribution of this work: 

1. New method. We develop a novel, fundamental, and surprisingly simple 
randomized iterative method for solving (1.1). 

2. Six equivalent formulations. Our method allows for several seemingly 
different but nevertheless equivalent formulations. First, it can be seen as a sketch- 
and-project method, in which the system (1.1) is replaced by its random sketch , and 
then the current iterate is projected onto the solution space of the sketched system. 
We can also view it as a constrain-and-approximate method, where we constrain the 
next iterate to live in a particular random affine space passing through the current 
iterate, and then pick the point from this subspace which best approximates the op¬ 
timal solution. Third, the method can be seen as an iterative solution of a sequence 
of random (and simpler) linear equations. The method also allows for a simple geo¬ 
metrical interpretation: the new iterate is defined as the unique intersection of two 
random affine spaces which are orthogonal complements. The fifth viewpoint gives a 
closed form formula for the random update which needs to be applied to the current 
iterate in order to arrive at the new one. Finally, the method can be seen as a random 
fixed point iteration. 

3. Special cases. These multiple viewpoints enrich our interpretation of the 
method, and enable us to draw previously unknown links between several existing 
algorithms. Our algorithm has two parameters, an n x n positive definite matrix B 
defining geometry of the space, and a random matrix S. Through combinations of 
these two parameters, in special cases our method recovers several well known algo¬ 
rithms. For instance, we recover the randomized Kaczmarz method of Strohmer and 
Vershyinin [38], randomized coordinate descent method of Leventhal and Lewis [19], 
random pursuit [25, 37, 36, 35] (with exact line search), and the stochastic Newton 
method recently proposed by Qu et al [29]. However, our method is more general, and 
leads to i) various generalizations and improvements of the aforementioned methods 
(e.g., block setup, importance sampling), and ii) completely new methods. Random¬ 
ness enters our framework in a very general form, which allows us to obtain a Gaussian 
Kaczmarz method , Gaussian descent , and more. 

4. Complexity: general results. When A has full column rank, our framework 
allows us to determine the complexity of these methods using a single analysis. Our 
main results are summarized in Table 1, where {a; 1 } are the iterates of our method, 
Z is a random matrix dependent on the data matrix A, parameter matrix B and 
random parameter matrix S, defined as 

( 1 . 2 ) z d = a t s(s t ab~ 1 a t s^s t a, 

where f denotes the (Moore-Penrose) pseudoinverse 1 . Moreover, ||x||b = f \J(x, x) B , 

where ( x,y) B c = x T By , for all x,y £ K”. It can be deduced from the properties of 
the pseudoinverse that Z is necessarily symmetric and positive semidefinite 2 . 

As we shall see later, we will often consider setting B = /, B = A (if A is positive 
definite) or B = A T A (if A is of full column rank). In particular, we first show that 
the convergence rate p is always bounded between zero and one. We also show that 
as soon as E [Z\ is invertible (which can only happen if A has full column rank, which 

1 Every (not necessarily square) real matrix M has a real pseudoinverse. In particular, in this 
paper we will use the following properties of the pseudoinverse: MM'M = M, M'MM* = M, 
(M T A/)tM T = M +, (M t )t = (Mt) T and (MM T ) + = (Aft) T Aft. 

2 Indeed, it suffices to use the identity (MM T )t = (Aft) T A/t w jth M = S T . 
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E [x k+1 - x*] = (I - B- 1 E [Z]) E [x k - x*] 

Theorem 4.1 

IIE [x k+1 - x*] ||| < p 2 • ||E [x k - x*] HI 

Theorem 4.4 

E [||x fc+1 — x* ||] < p • E [ x fc — x* |||] 

Theorem 4.6 


Table 1: Our main complexity results. The convergence rate is: p = 1 — 
A min (B- 1 / 2 E[Z].B- 1 / 2 ). 


then implies that x* is unique), we have p < 1, and the method converges. Besides 
establishing a bound involving the expected norm of the error (see the last line of 
Table 1), we also obtain bounds involving the norm of the expected error (second line 
of Table 1). Studying the expected sequence of iterates directly is very fruitful, as 
it allows us to establish an exact characterization of the evolution of the expected 
iterates (see the first line of Table 1) through a linear fixed point iteration. 

Both of these theorems on the convergence of the error can be recast as iteration 
complexity bounds. For instance, using standard arguments, from Theorem 4.4 in 
Table 1 we observe that for a given e > 0 we have that 

(1.3) ^ “ T^p l0g (e) ^ ll E [x k - x*] || B < e||x° - x*\\ B . 

5. Complexity: special cases. Besides these generic results, which hold 
without any major restriction on the sampling matrix S (in particular, it can be either 
discrete or continuous), we give a specialized result applicable to discrete sampling 
matrices S (see Theorem 5.2). In the special cases for which rates are known, our 
analysis recovers the existing rates. 

6. Extensions. Our approach opens up many avenues for further development 
and research. For instance, it is possible to extend the results to the case when A is 
not necessarily of full column rank. Furthermore, as our results hold for a wide range 
of distributions, new and efficient variants of the general method can be designed for 
problems of specific structure by fine-tuning the stochasticity to the structure. Similar 
ideas can be applied to design randomized iterative algorithms for finding the inverse 
of a very large matrix. 

1.2. Background and Related Work. The literature on solving linear systems 
via iterative methods is vast and has long history [1 7, 33]. For instance, the Kaczmarz 
method, in which one cycles through the rows of the system and each iteration is 
formed by projecting the current point to the hyperplane formed by the active row, 
dates back to the 30’s [16]. The Kaczmarz method is just one example of an array 
of row-action methods for linear systems (and also, more generally, feasibility and 
optimization problems) which were studied in the second half of the 20th century [4] . 

Research into the Kaczmarz method was in 2009 reignited by Strohmer and Ver- 
shynin [38], who gave a brief and elegant proof that a randomized thereof enjoys an 
exponential error decay (also know as “linear convergence”). This has triggered much 
research into developing and analyzing randomized linear solvers. 

It should be mentioned at this point that the randomized Kaczmarz (RK) method 
arises as a special case (when one considers quadratic objective functions) of the 
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stochastic gradient descent (SGD) method for convex optimization which can be traced 
back to the seminal work of Robbins and Monro’s on stochastic approximation [31]. 
Subsequently, intensive research went into studying various extensions of the SGD 
method. However, to the best of our knowledge, no complexity results with exponen¬ 
tial error decay were established prior to the aforementioned work of Strohmer and 
Vershynin [38]. This is the reason behind our choice of [38] as the starting point of 
our discussion. 

Motivated by the results of Strohmer and Vershynin [38], Leventlral and Lewis [19] 
utilize similar techniques to establish the first bounds for randomized coordinate de¬ 
scent methods for solving systems with positive definite matrices, and systems aris¬ 
ing from least squares problems [1 ]. These bounds are similar to those for the RK 
method. This development was later picked up by the optimization and machine 
learning communities, and much progress has been made in generalizing these early 
results in countless ways to various structured convex optimization problems. For a 
brief up to date account of the development in this area, we refer the reader to [12, 
28] and the references therein. 

The RK method and its analysis have been further extended to the least-squares 
problem [22, 41] and the block setting [23, 24]. In [21] the authors extend the ran¬ 
domized coordinate descent and the RK methods to the problem of solving under¬ 
determined systems. The authors of [21, 30] analyze side-by-side the randomized 
coordinate descent and RK method, for least-squares, using a convenient notation in 
order to point out their similarities. Our work takes the next step, by analyzing these, 
and many other methods, through a genuinely single analysis. Also in the spirit of 
unifying the analysis of different methods, in [26] the authors provide a unified analysis 
of iterative Schwarz methods and Kaczmarz methods. 

The use of random Gaussian directions as search directions in zero-order (derivative- 
free) minimization algorithm was recently suggested [25]. More recently, Gaussian 
directions have been combined with exact and inexact line-search into a single ran¬ 
dom pursuit framework [35] , and further utilized within a randomized variable metric 
method [36, 37]. 

2. One Algorithm in Six Disguises. Our method has two parameters: i) an 
n x n positive definite matrix B which is used to define the R-inner product and the 
induced R-norm by 

(2-1) (x : y) B = { (Bx,y), ||x|| B = f y/(x,x) B , 


where (•,•) is the standard Euclidean inner product, and ii) a random matrix S £ 
to be drawn in an i.i.d. fashion at each iteration. We stress that we do not 
restrict the number of columns of S; indeed, we even allow q to vary (and hence, q is 
a random variable). 

2.1. Six Viewpoints. Starting from x k £ R ra , our method draws a random 
matrix S and uses it to generate a new point x k+1 £ R n . This iteration can be 
formulated in six seemingly different but equivalent ways: 

1. Sketching Viewpoint: Sketch-and-Project. x k+1 is the nearest point to x k 
which solves a sketched version of the original linear system: 


( 2 . 2 ) 


x k+l = arg min ||x — x k \\ B subject to S T Ax = S T b 


This viewpoint arises very naturally. Indeed, since the original system (1.1) is assumed 
to be complicated, we replace it by a simpler system—a random sketch of the original 
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system (1.1)— whose solution set {x \ S T Ax = S T b} contains all solutions of the 
original system. However, this system will typically have many solutions, so in order 
to define a method, we need a way to select one of them. The idea is to try to 
preserve as much of the information learned so far as possible, as condensed in the 
current point x k . Hence, we pick the solution which is closest to x k . 

2. Optimization Viewpoint: Constrain-and-Approximate. x k+1 is the best ap¬ 
proximation of x* in a random space passing through x k : 


(2.3) 


x k+1 = arg min ||ar — x*\\g 

xGM 71 


subject to x 


x k + B 1 A r Sy, y is free 


The above step has the following interpretation 3 . We choose a random affine space 
containing x k , and constrain our method to choose the next iterate from this space. 
We then do as well as we can on this space; that is, we pick x k+1 as the point which 
best approximates x*. Note that x k+l does not depend on which solution x* is used in 
(2.3) (this can be best seen by considering the geometric viewpoint, discussed next). 

3. Geometric viewpoint: Random Intersect. x k+1 is the (unique) intersection of 
two affine spaces: 


(2.4) 


{cc fc+1 } = (x* + Null (S t A)) p| {x k + Range (B~ 1 A t S)) 


First, note that the first affine space above does not depend on the choice of x* from 
the set of optimal solutions of (1.1). A basic result of linear algebra says that the 
nullspace of an arbitrary matrix is the orthogonal complement of the range space of 
its transpose. Hence, whenever we have h £ Null (S' T A) and where q is the 

number of rows of 5, then ( h 1 A T Sy) = 0. It follows that the two spaces in (2.4) 
are orthogonal complements with respect to the H-inner product and as such, they 
intersect at a unique point (see Figure 2.1). 

4- Algebraic viewpoint: Random Linear Solve. Note that x k+1 is the (unique) 
solution (in x) of a linear system (with variables x and y): 


(2.5) 


x k+1 = solution of S T Ax = S T b, x = x k + B~ 1 A T Sy 


This system is clearly equivalent to (2.4), and can alternatively be written as: 


( 2 . 6 ) 


(S T A 0 \fx\_ (S T b\ 

V B -A T S) \y) \Bx k ) ■ 


Hence, our method reduces the solution of the (complicated) linear system (1.1) into 
a sequence of (hopefully simpler) random systems of the form (2.6). 


3 Formulation (2.3) is similar to the framework often used to describe Krylov methods [20, Chapter 
1], which is 

x k+i ar g m j n ilj; _ ||r subject to x E x° + /Cz c _|_i, 

where /Cfc_|_i C R n is a (k + l)-dimensional subspace. Note that the constraint x E x° + K-k+i is an 
affine space that contains x°, as opposed to x k in our formulation (2.3). The objective ||a; — is a 
generalization of the residual, where B = A T A is used to characterize minimal residual methods [27, 
34] and B = A is used to describe the Conjugate Gradients method [15]. Progress from iteration to 
the next is guaranteed by using expanding nested search spaces at each iteration, that is, /C^ C /Cfc+i. 
In our setting, progress is enforced by using x k as the displacement term instead of x° . This also 
allows for a simple recurrence for updating x k to arrive at x k + 1 , which facilitates the analyses of 
the method. In the Krylov setting, to arrive at an explicit recurrence, one needs to carefully select 
a basis for the nested spaces that allows for short recurrence. 






6 



Fig. 2.1: The geometry of our algorithm. The next iterate, x h+1 . arises as the intersection of two 
random affine spaces: x k + Range f//~ 1 /1 7 ,Sj and x* + Null ( S T A ) (see (2.4)). The spaces are 
orthogonal complements of each other with respect to the /Tinner product, and hence x k+1 can 
equivalently be written as the projection, in the //-norm, of x k onto x * + Null (S' r A) (see (2.2)), 
or the projection of x* onto x k + Range (B _1 A T S) (see (2.3)). The intersection x k+1 can also be 
expressed as the solution of a system of linear equations (see (2.5)). Finally, the new error x k ~ 1 — x* 
is the projection, with respect to the //-inner product, of the current error x k — x* onto Null (,S' 7 '.4). 
This gives rise to a random fixed point formulation (see (2.8)). 


5. Algebraic viewpoint: Random Update. By plugging the second equation in (2.5) 
into the first, we eliminate x and obtain the system (S T AB~ X A T S)y = S T (b— Ax k ). 
Note that for all solutions y of this system we must have x k+1 = x k + B~ 1 A T Sy. 
In particular, we can choose the solution y = y k of minimal Euclidean norm, which 
is given by y k = (S T AB _1 A T S)^S T (b — Ax k ), where 1 denotes the Moore-Penrose 
pseudoinverse. This leads to an expression for x k+1 with an explicit form of the 
random update which must be applied to x k in order to obtain x k+1 \ 


(2.7) 


x k+1 =x k - B- 1 A T S(S T AB- 1 A T S)' i S T (Ax k - b ) 


In some sense, this form is the standard: it is customary for iterative techniques to 
be written in the form x k+1 = x k + d k , which is precisely what (2.7) does. 

6. Analytic viewpoint: Random Fixed Point. Note that iteration (2.7) can be 
written as 


( 2 . 8 ) 


x k+1 -x* = (I-B~ 1 Z){ x k -x*) 


where Z is defined in (1.2) and where we used the fact that Ax* = b. Matrix Z plays a 
central role in our analysis, and can be used to construct explicit projection matrices 
of the two projections depicted in Figure 2.1. 

The equivalence between these six viewpoints is formally captured in the next 
statement. 

Theorem 2.1 (Equivalence). The six viewpoints are equivalent: they all produce 
the same (unique) point x k+1 . 

Proof. The proof is simple, and follows directly from the above discussion. In 
particular, see the caption of Figure 2.1. □ 
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2.2. Projection Matrices. In this section we state a few key properties of 
matrix Z. This will shed light on the previous discussion and will also be useful later 
in the convergence proofs. 

Recall that S is a m x q random matrix (with q possibly being random), and that 
A is an m x n matrix. Let us define the random quantity 

(2.9) d d = Rank (, S T A) 
and notice that d < minjg, n}, 

(2.10) dim (Range (B~ 1 A t S)) = d, and dim (Null (S' r A)) = n — d. 

Lemma 2.2. With respect to the geometry induced by the B-inner product, we 
have that 

(i) B~ 1 Z projects orthogonally onto the d-dimensional subspace Range ( B~ 1 A T S ) 
(ii) ( I—B~ 1 Z ) projects orthogonally onto (n—d) -dimensional subspace Null [S T A) . 
Proof. For any matrix M, the pseudoinverse satisfies the identity M' MM' = Mb 
Using this with M = S T AB~ l A T S , we get 

(B-'Z) 2 ( = 2) B- 1 A t S(S t AB- 1 A t 5) f S T AB~ X A T S{S T AB~ X A T 5) f S T A 

(2.11) = B- 1 A T S(S T AB~ 1 A T S) f S T A ( = 2) B~ X Z, 

and thus both B~ X Z and I — B~ 1 Z are projection matrices. In order to establish that 
B _1 Z is an orthogonal projection with respect to the R-inner product (from which 
it follows that / — B~ X Z is), we will show that 

B~ 1 Z(B~ 1 A t S) = B~ 1 A t S , and B^Zy = 0, \/y G Null (, S T A) . 

The second relation is trivially satisfied. In order to establish the first relation, it is 
enough to use two further properties of the pseudoinverse: (M T M)t M T = M f and 
MM^M = M, both with M = B~ 1 / 2 A T S. Indeed, 

B- 1 Z(B~ 1 A t S) (1 = 2) R- 1 / 2 M(M T M) t M T M 
= R“ 1/2 MM t M 
= B~ 1/2 M = B~ 1 A t S. 


□ 

This lemma sheds additional light on Figure 2.1 as it gives explicit expressions 
for the associated projection matrices. The result also implies that I — B~ X Z is a 
contraction with respect to the R-norm, which means that the random fixed point 
iteration (2.8) has only very little room not to work. While I — B~ X Z is not a strict 
contraction, under some reasonably weak assumptions on S it will be a strict contrac¬ 
tion in expectation, which ensures convergence. We shall state these assumptions and 
develop the associated convergence theory for our method in Section 4 and Section 5. 

3. Special Cases: Examples. In this section we briefly mention how by select¬ 
ing the parameters S and B of our method we recover several existing methods. The 
list is by no means comprehensive and merely serves the purpose of an illustration of 
the flexibility of our algorithm. All the associated complexity results we present in 
this section, can be recovered from Theorem 5.2, presented later in Section 5. 


3.1. The One Step Method. When S is an m x m invertible matrix with 
probability one, then the system S T Ax = S T b is equivalent to solving Ax = b , thus 
the solution to (2.2) must be x k+1 = x*, independently of matrix B. Our convergence 
theorems also predict this one step behaviour, since p = 0 (see Table 1). 

3.2. Random Vector Sketch. When S = s £ R m is restricted to being a 
random column vector, then from (2.7) a step of our method is given by 


(3.1) 

if A T s ^ 0 and x k+1 = 
a £ ffi. is given by 


fc +i fc s T (Ax k - b) , T 
x k+ =x k --—- A-B 1 A i s 


s T AB~ 1 A T s 

x k otherwise. This is because the pseudo inverse of a scalar 


a 


t _ 



if a 7 ^ 0 
if a = 0 . 


Next we describe several well known specializations of the random vector sketch and 
for brevity, we write the updates in the form of (3.1) and leave implicit that when 
the denominator is zero, no step is taken. 

3.3. Randomized Kaczmarz. If we choose S = e l (unit coordinate vector in 
R m ) and B = I (the identity matrix), in view of (2.2) we obtain the method: 

(3.2) x k+1 = arg min ||a; — x k \\\ subject to A i: x = bi. 


Using (2.7), these iterations can be calculated with 


(3.3) 


x k+1 = x k 


Ai,x k - bi 

11 ^: 11 ! 


(4:) T 


Complexity. When i is selected at random, this is the randomized Kaczmarz 
( RK) method [38]. A specific non-uniform probability distribution for S can yield 
simple and easily interpretable (but not necessarily optimal) complexity bound. In 
particular, by selecting i with probability proportional to the magnitude of row i of A, 
that is pi = || 4: ||!/ll 41 f, ^ follows from Theorem 5.2 that RK enjoys the following 
complexity bound: 

(3.4) E [U**-all] <(l — 


Amin {A T A) 
\\ a \\ 2 f 


|x° — x* 


This result was first established by Strohmer and Vershynin [38]. We also provide 
new convergence results in Theorem 4.4, based on the convergence of the norm of the 
expected error. Theorem 4.4 applied to the RK method gives 

(3.5) ||E [x k — x*] ||i < ( 1 — 


Amin 

II4If 


2k 


||*° — X* 


Now the convergence rate appears squared, which is a better rate, though, the expec¬ 
tation has moved inside the norm, which is a weaker form of convergence. 

Analogous results for the convergence of the norm of the expected error holds for 
all the methods we present, though we only illustrate this with the RK method. 
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Re-interpretation as SGD with exact line search. Using the Constrain-and-Approximate 
formulation (2.3), the randomized Kaczmarz method can also be written as 

x k+1 = arg min ||x — ar*||| subject to x = x k + y(A i: ) T , j/gK, 

with probability pi. Writing the least squares function /( x) = \\\Ax — b\\ 2 as 


f(x) = y^Pifiix), 


2 = 1 


fi(x) = (A i: x - bi) 2 , 

%Pi 


we see that the random vector V/i(x) = A(Ai,x — bi)(Ai : ) T is an unbiased estimator 
of the gradient of / at x. That is, E[V/j(x)] = V/(x). Notice that RK takes 
a step in the direction — V/;(x). This is true even when A^ : x — bi = 0, in which 
case, the RK does not take any step. Hence, RK takes a step in the direction of 
the negative stochastic gradient. This means that it is equivalent to the Stochastic 
Gradient Descent (SGD) method. However, the stepsize choice is very special: RK 
chooses the stepsize which leads to the point which is closest to x* in the Euclidean 
norm. 


3.4. Randomized Coordinate Descent: positive definite case. If A is 

positive definite, then we can choose B = A and S = e l in (2.2), which results in 

(3.6) x k+1 = f arg min ||x — x k \\\ subject to (A i: ) T x = bi, 

x£M. n 


where we used the symmetry of A to get (e*) T A = A;. = (A : i) T . The solution to the 
above, given by (2.7), is 


(3.7) 


x k+1 =x k - 


(A i: ) 


T„k 


- b 


Ai 


V 


Complexity. When i is chosen randomly, this is the Randomized CD method 
(CD-pd). Applying Theorem 5.2, we see the probability distribution pi = A„ : /Tr (A) 
results in a convergence with 

(3.8) E [||x fc — x*\\\] < 

This result was first established by Leventhal and Lewis [19]. 

Interpretation. Using the Constrain-and-Approximate formulation (2.3), this method 
can be interpreted as 

(3.9) x k+1 = argmin ||x — x*\\ 2 A subject to x = x k + ye l , yGM., 

with probability It is easy to check that the function f(x) = \x T Ax — b T x satisfies: 

||x — x*||^ = 2 f(x) + b T x*. Therefore, (3.9) is equivalent to 

(3.10) x k+1 = arg min/(x) subject to x = x k + ye 1 , y S R. 


The iterates (3.7) can also be written as 

x k+1 =x k - -^V t /(x fe )e i , 

■L'i 

where Li = An is the Lipschitz constant of the gradient of / corresponding to coor¬ 
dinate i and V,/(x fe ) is the ith partial derivative of / at x k . 
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3.5. Randomized Block Kaczmarz. Our framework also extends to new 
block formulations of the randomized Kaczmarz method. Let R be a random subset 
of [m] and let S = I :R be a column concatenation of the columns of the mxm identity 
matrix I indexed by R. Further, let B = I. Then (2.2) specializes to 


x k+1 = arg min ||x — s fc ||f subject to A R: x = b R . 

zeR™ 

In view of (2.7), this can be equivalently written as 


(3.11) 


x k+1 = x k ~ (A R .f(A R ..(A R ..) T )HA R .x k - b R ) 


Complexity. From Theorem 4.6 we obtain the following new complexity result: 


E [|| cn fc - x*\\l] < (1 - A min (E [(A*)^*^*)*)^]))* ||:r 0 - z*|||. 

To obtain a more meaningful convergence rate, we would need to bound the 
smallest eigenvalue of E [(A R .) T (A R .(A R .) T y A R .] . This has been done in [23, 24] 
when the image of R defines a row paving of A. Our framework paves the way for 
analysing the convergence of new block methods for a large set of possible random 
subsets R , including, for example, overlapping partitions. 

3.6. Randomized Newton: positive definite case. If A is symmetric posi¬ 
tive definite, then we can choose B = A and S = J : c, a column concatenation of the 
columns of I indexed by C, which is a random subset of [n]. In view of (2.2), this 
results in 

(3.12) x k+1 = f arg min ||a: — a; fc ||^ subject to ( A-c) T x = bc■ 

x£M. n 


In view of (2.7), we can equivalently write the method as 


(3.13) 


x k+1 


X k - Lc((I:c) T AI :C )- 1 (I:c) T (Ax k - b) 


Complexity. Clearly, iteration (3.13) is well defined as long as C is nonempty with 
probability 1. Such C is in [29] referred to by the name “non-vacuous” sampling. From 
Theorem 4.6 we obtain the following convergence rate: 


E[\\x k -x*\\\] <p k \\x°-x*r A 

(3.14) = (1 - A min (E [L c {{I:c) T AI :C )- 1 {I..c) T A])) k ||a: 0 - x*\\ 2 A . 


The convergence rate of this particular method was first established and studied 
in [29] . Moreover, it was shown in [29] that p < 1 if one additionally assumes that the 
probability that i € C is positive for each column i £ [n], i.e., that C is a “proper” 
sampling. 

Interpretation. Using formulation (2.3), and in view of the equivalence between 
f(x) and ||x — x*\\ 2 A discussed in Section 3.4, the Randomized Newton method can be 
equivalently written as 

x k+1 = arg min f(x) subject to x = x k + I-c y, y £ 

x£M. n 
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The next iterate is determined by advancing from the previous iterate over a subset 
of coordinates such that / is minimized. Hence, an exact line search is performed in 
a random \C\ dimensional subspace. 

Method (3.13) was fist studied by Qu et al [29], and referred therein as “Method 1”, 
or Randomized Newton Method. The name comes from the observation that the 
method inverts random principal submatrices of A and that in the special case when 
C = [n] with probability 1, it specializes to the Newton method (which in this case 
converges in a single step). The expression p defining the convergence rate of this 
method is rather involved and it is not immediately obvious what is gained by per¬ 
forming a search in a higher dimensional subspace (C > 1 ) rather than in the one¬ 
dimensional subspaces (C = 1), as is standard in the optimization literature. Let 
us write p = 1 — ay in the case when the C is chosen to be a subset of [n] of size 
r, uniformly at random. In view of (1-3), the method takes 0(l/cr T ) iterations to 
converge, where the tilde notation suppresses logarithmic terms. It was shown in [! 
that 1/oy < 1/(t(7i). That is, one can expect to obtain at least superlinear speedup in 
r - this is what is gained by moving to blocks / higher dimensional subspaces. For 
further details and additional properties of the method we refer the reader to [29] . 

3.7. Randomized Coordinate Descent: least-squares version. By choos¬ 
ing S = Ae l =: A : i as the ith column of A and B = A T A 1 the resulting iterates (2.3) 
are given by 

(3.15) x k+1 = arg min \\Ax — b\\\ subject to x = x k + ye l , y G R. 

xER™ 


When i is selected at random, this is the Randomized Coordinate Descent method 
( CD-LS ) applied to the least-squares problem: rriin x \\Ax — b\\\. Using (2.7), these 
iterations can be calculated with 


(3.16) 


x k+1 = x k 


(A.. i ) T (Ax k -b) 4 


Complexity. Applying Theorem 5.2, we see that by selecting i with probability 
proportional to magnitude of column i of A, that is = ||A.j|||/||A|||., results in a 
convergence with 

(3.17) E[\\x k -x*\\ 2 ATA ] < p k \\x° - x*\\\ TA = (l- 


(gd) V 


UWf 


\\x°-x*\\ 2 ata . 


This result was first established by Leventhal and Lewis [19]. 

Interpretation. Using the Constrain-and-Approximate formulation (2.3), the CD- 
LS method can be interpreted as 

(3.18) x k+1 = arg min ||a: — subject to x = x k + ye 1 , j/gR. 


The CD-LS method selects a coordinate to advance from the previous iterate x k , then 
performs an exact minimization of the least squares function over this line. This is 
equivalent to applying coordinate descent to the least squares problem rnin xeK n f(x) = f 
i[|Ar — b |||. The iterates (3.15) can be written as 

x k+1 =x k - ±-Vif (x k )e\ 


where Li = f ||Aj||! is the Lipschitz constant of the gradient corresponding to coordi¬ 
nate i and V;/(a; fc ) is the Ah partial derivative of / at x k . 
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4. Convergence: General Theory. We shall present two complexity theo¬ 
rems: we first study the convergence of ||E [x fc — x*] || , and then move on to analysing 
the convergence of E [||x fe — x*||]. 

4.1. Two types of convergence. The following lemma explains the relation¬ 
ship between the convergence of the norm of the expected error and the expected 
norm of the error. 

Lemma 4.1. Let x £ R n be a random vector, || • || a norm induced by an inner 
product and fix x* £ R" . Then 


E [x - x*] 


2 


E 



E 



Proof. Note that E [||x — E [x] || 2 ] = E [||x|| 2 ] — ||E [x] || 2 . Adding and subtracting 
||x* || 2 —2 (E [x], x*) from the right hand side and grouping the appropriate terms yields 
the desired result. □ 


To interpret this lemma, note that E 


x — E [x]|| 2 =Er=i E [(^-EN) 2 ] = 


E" = i Var(xj), where X; denotes the ith element of x. This lemma shows that the 
convergence of x to x* under the expected norm of the error is a stronger form of 
convergence than the convergence of the norm of the expected error, as the former 
also guarantees that the variance of Xj converges to zero, for i = 1 ,..., n. 


4.2. The Rate of Convergence. All of our convergence theorems (see Table 1) 
depend on the convergence rate 


(4.1) p d = 1 - A min (R- 1 E [Z]) = 1 - A min (B” 1 / 2 E [Z\ LT 1 / 2 ). 


To show that the rate is meaningful, in Lemma 4.2 we prove that 0 < p < 1. We also 
provide a meaningful lower bound for p. 

Lemma 4.2. The quantity p defined in (4.1) satisfies: 

(4.2) 0 < 1 - <p< 1, 

n 

where d = Rank (S T A). 

Proof. Since the mapping A i-a A max (A) is convex on the set of symmetric matri¬ 
ces, by Jensen’s inequality we get 


(4.3) A max (E [B~ 1 Z ]) = A max (R" 1 / 2 E [Z\ £T 1/2 ) < E [a max (Lr 1/2 ZLT 1/2 ) 


Recalling from Lemma 2.2 that B X Z is a projection, we get 

b- 1/2 zb- 1/2 {b- 1/2 zb~ 1/2 ) = B~ 1/2 ZB~ 1/2 , 

whence the spectrum of B~ 1 / 2 ZB~ 1 / 2 is contained in {0,1}. Thus, A max (iJ _1 ' /2 ZR _1 ' /2 ) < 1, 
and from (4.3) we conclude that A max (13 -1 E [Z]) < 1. The inequality A m i n (l?“ 1 E [Z\) > 0 
can be shown analogously using convexity of the mapping A > — A m i n (A). Thus 

A min (R- 1 E [Z]) = A min (R- 1 / 2 E [Z] B~^ 2 ) £ [0,1] 


and consequentially 0 < p < 1. As the trace of a matrix is equal to the sum of its 
eigenvalues, we have 

E [Tr (B~ 1 Z)] = Tr (E [B~ x Z\ ) > n A min (E [B^Z]). 


(4.4) 
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As B 1 Z projects onto a d-dimensional subspace (Lemma 2.2) we have Tr ( B l Z ) = 
d. Thus rewriting (4.4) gives 1 — E [d\ /n < p. □ 

The lower bound on p in item 1 has a natural interpretation which makes intuitive 
sense. We shall present it from the perspective of the Constrain-and-Approximate 
formulation (2.3). As the dimension (d) of the search space B~ 1 A T S increases (see 
(2.10)), the lower bound on p decreases, and a faster convergence is possible. For 
instance, when S is restricted to being a random column vector, as it is in the RK (3.3), 
CD-LS (3.16) and CD-pd (3.8) methods, the convergence rate is bounded with 1 — 
1/n < p. Using (1.3), this translates into the simple iteration complexity bound of 
k > nlog(l/e). On the other extreme, when the search space is large, then the lower 
bound is close to zero, allowing room for the method to be faster. 

We now characterize circumstances under which p is strictly smaller than one. 

Lemma 4.3. If E [Z] is invertible, then p < 1, A has full column rank and x* is 
unique. 

Proof. Assume that E [ Z\ is invertible. First, this means that F? -1 / 2 E [Z] R _1//2 
is positive definite, which in view of (4.1) means that p < 1. If A did not have full 
column rank, then there would be 0 ^ x £ R n such that Ax = 0. However, we 
then have Zx = 0 and also E [Z]x = 0, contradicting the assumption that E [Z] is 
invertible. Finally, since A has full column rank, x* must be unique (recall that we 
assume throughout the paper that the system Ax = b is consistent). □ 

4.3. Exact Characterization and Norm of Expectation. We now state a 
theorem which exactly characterizes the evolution of the expected iterates through 
a linear fixed point iteration. As a consequence, we obtain a convergence result for 
the norm of the expected error. While we do not highlight this in the text, this 
theorem can be applied to all the particular instances of our general method we detail 
throughout this paper. 

For any M £ R" x " let us define 

(4.5) \\M\\ B d = max \\Mx\\ B . 

IWIb=i 

Theorem 4.4 (Norm of expectation). For every x* £ R" satisfying Ax = b we 
have 

(4.6) E [x k+1 - a:*] = (/ - B” 1 E [Z]) E [x k - **] . 

Moreover, the spectral radius and the induced B-norm of the iteration matrix I — 
B~ l E[Z] are both equal to p: 

A max (/ - B- 1 E [Z]) = ||/ - -B -1 E [Z] || B = p. 


Therefore, 

(4.7) ||E [x k — x*] \\ B < P k \\x° - x*\\ B . 


Proof. Taking expectations conditioned on x k in (2.8), we get 


(4.8) 


E [x k+1 



(I - B~ 1 F,[Z])(x k - x*). 


— x 
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Taking expectation again gives 

E [x k+1 - ®*] = E [E [x- fe+1 - a;* | x k ) ] 

( = 8) E [(/ - B ~ l E [Z])(x k - a*)] 

= (I-B~ 1 F,[Z})F,[x k -x*]. 

Applying the norms to both sides we obtain the estimate 

||E [a; fe+1 - **] || B < ||7 - B~ l E [Z] || B ||E [x k - a*] || B . 

It remains to prove that p = ||7 — B ~ l E [Z\ || B and then unroll the recurrence. Ac¬ 
cording to the definition of operator norm (4.5), we have 

||/-B- 1 E[Z]|H= max \\BV\l - B~ X E [Z])x\\ 2 . 

|| B 1 / 2 x || 2 =1 

Substituting B x ! 2 x = y in the above gives 

|| J - B- 1 E [Z] ||| = max \\B^ 2 {I - B~ l E [Z])B-'/ 2 y\\ 2 
I|W||2 = 1 

= max ||(/-B- 1 / 2 E[ 2 ’]i?- 1 / 2) 2/ ||2 
lly||2=i 

= \ 2 ia ^(I-B- 1 ' 2 E[Z\B-V 2 ) 

= (l - A min (5-V 2 E [Z] B- 1 ' 2 )) 2 = p 2 , 

where in the third equality we used the symmetry of (I — B~ 1 E \Z] B _1 ) when passing 
from the operator norm to the spectral radius. Note that the symmetry of E [Z] 
derives from the symmetry of Z. □ 

4.4. Expectation of Norm. We now turn to analysing the convergence of the 
expected norm of the error, for which we need the following technical lemma. 
Lemma 4.5. IfE[Z] is positive definite, then 

(4-9) {E[Z}y,y)>{l-p)\\y\\ 2 B , My G K". 


Proof. As E [Z] and B are positive definite, we get 

1 — p = A min (.B -1/2 E [Z\ B~ x t 2 ) = max jf | B~ 1/2 E [Z] B~ 1/2 - t ■ I A o} 

= max {t | E[Z] — t ■ B >z 0} . 

Therefore, E [Z] A (1 — p)B , and the result follows.□ 

Theorem 4.6 (Expectation of norm). If E [Z\ is positive definite, then 

(4.10) E[\\x k -x*\\l] <p k \\x°-x*\\ 2 B , 

where p < 1 is given in (4.1). 

Proof. Let r k = x k — x*. Taking the expectation of (2.8) conditioned on r k we 
get 

E[||r fc+ 1 ||||r fc ] ( = 8) E [||(/ — B~ x Z)r k \\ 2 B | r fc ] 

(2 = 1} E [((B- Z)r k ,r k ) \ r k ] 

(Lemma (4.5)) 

= ||r fe |||-(E[Z]r fc ,r fc ) < p-\\r k \\ 2 B . 
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Taking expectation again and unrolling the recurrence gives the result. □ 

The convergence rate p of the expected norm of the error is “worse” than the p 2 
rate of convergence of the norm of the expected error in Theorem 4.4. This should 
not be misconstrued as Theorem 4.4 offering a “better” convergence rate than The¬ 
orem 4.6, because, as explained in Lemma 4.1, convergence of the expected norm of 
the error is a stronger type of convergence. More importantly, the exponent is not of 
any crucial importance; clearly, an exponent of 2 manifests itself only in halving the 
number of iterations. 

5. Methods Based on Discrete Sampling. When S has a discrete distribu¬ 
tion, we can establish under reasonable assumptions when E [Z\ is positive definite 
(Proposition 5.1), we can optimize the convergence rate in terms of the chosen prob¬ 
ability distribution, and finally, determine a probability distribution for which the 
convergence rate is expressed in terms of the scaled condition number (Theorem 5.2). 

Assumption 5.1 (Complete Discrete Sampling). The random matrix S has a 
discrete distribution. In particular, S = Si £ R mX9i with probability pt > 0, where 
SjA has full row rank and qt £ N, for i = 1,..., r. Furthermore S d = [Si,..., S r ] £ 
K mX ^fc 1 qi is such that A 1 S has full row rank. 

As an example of complete discrete sampling, if A has full column rank and each 
row of A is not strictly zero, S = e l with probability Pi = 1/n, for i = 1,..., n, then 
S = I then S is a complete discrete sampling. In fact, from any basis of M" we can 
construct a complete discrete sampling in an analogous way. 

When S is a complete discrete sampling, then S T A has full row rank and (S T AB~ 1 A T S)^ 
(S T AB~ X A T S) _1 . Therefore we replace the pseudo-inverse in (2.7) and (2.8) by the 
inverse. Furthermore, using a complete discrete sampling guarantees convergence of 
the resulting method. 

Proposition 5.1. Let S be a complete discrete sampling, then E [Z] is positive 
definite. 

Proof. Let 

(5.1) D = diag (^pT(( S 1 ) T AB~ 1 A T S 1 )" 1 / 2 ,..., V^((S r ) T AB~ l A T S,)" 1 / 2 ) 

which is a block diagonal matrix, and is well defined and invertible as Sj A has full 
row rank for i = 1,..., r. Taking the expectation of Z (1.2) gives 

E [Z\ = Y j A T S l {SfAB- 1 A T S i )- 1 SjAp l 

i= 1 

= A T (^S i ^ i {SjAB- 1 A T S. i )- 1 / 2 {SjAB- 1 A T S i y 1 ' 2 y /p i Sf^ A 

(5.2) = (A t SD) (DS t A) , 

which is positive definite because A T S has full row rank and D is invertible. □ 

With E [Z] positive definite, we can apply the convergence Theorem 4.4 and 4.6, and 
the resulting method converges. 

5.1. Optimal Probabilities. We can choose the discrete probability distri¬ 
bution that optimizes the convergence rate. For this, according to Theorems 4.6 
and 4.4 we need to find p = (pi,... ,p r ) that maximizes the minimal eigenvalue of 
B -1 / 2 E [Z] B~ x ! 2 . Let S' be a complete discrete sampling and fix the sample matrices 
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Si,..., S r . Let us denote Z = Z(p ) as a function of p = (pi ,... ,p r ). Then we can 
also think of the spectral radius as a function of p where 


p(p) = 1 - A mi „(fr 1/2 E [Z(p)} B ~ 1/2 ). 


Letting 


def 


A r = <P= (pi,...,p r ) £ 


^2 Pi = 1, P > o 


the problem of minimizing the spectral radius (i.e., optimizing the convergence rate) 
can be written as 

p* = f min p{jp) = 1- max\ min (B~ 1/2 E[Z(p)\B~ 1/2 ). 

p€ A r pe A r 

This can be cast as a convex optimization problem, by first re-writing 

B~ 1/2 -E[Z(p)]B- 1/2 =^2pi (B- l/2 A T Si{SjAB- 1 A T Si)- 1 SjAB~ 1/2 ^j 

i=1 

2=1 


where Vi = B 1 / 2 A T Si. Thus 


(5.3) p* = 1 - max A min ( ^PiVijV^Vi) 1 V- r 

P GA r ^ 

To obtain p that maximizes the smallest eigenvalue, we solve 

max t 

P ,t 

r 

(5.4) subject to Yl Pi A t ■ /, 

i= 1 

p £ A r . 

Despite (5.4) being a convex semi-definite program 4 , which is apparently a harder 
problem than solving the original linear system, investing the time into solving (5.4) 
using a solver for convex conic programming such as cvx [ ] can pay off, as we show 

in Section 7.4. Though for a practical method based on this, we would need to develop 
an approximate solution to (5.4) which can be efficiently calculated. 

5.2. Convenient Probabilities. Next we develop a choice of probability dis¬ 
tribution that yields a convergence rate that is easy to interpret. This result is new 
and covers a wide range of methods, including randomized Kaczmarz, randomized 
coordinate descent, as well as their block variants. However, it is more general, and 

4 When preparing a revision of this paper, we have learned about the existence of prior work [6 
where the authors have also characterized the probability distribution that optimizes the convergences 
rate of the RK method as the solution to an SDP. 
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covers many other possible particular algorithms, which arise by choosing a particular 
set of sample matrices S',, for i = 1 ,..., r. 

Theorem 5.2. Let S be a complete discrete sampling such that S = Si £ R m 
with probability 


(5.5) 


Tr {SjAB~ 1 A T S i ) 
\\B-V*ATS\\l 


Then the iterates (2.7) satisfy 

(5.6) E[\\x k -x*\\ 2 B ] <p k c \\x°-x*\\ 2 b , 

where 

, , A min ( S T AB~ 1 A T S) 

(5J) Pc ~ 1 115 - 1 / 2^8111 • 


Proof. Let ti = Tr (SjAB 1 A T S i ), and with (5.5) in (5.1) we have 
D2 = llB -J A T Sll 2 F d ^ HS^AB~ 1 A t S 1 )~\ ..., t r (S^AB- 1 A T S r )~ 1 ), 

thus 

(5.8) X min (D-)= llB -i /2A T Sll 2 F ™™{ Xni ^ { sTAB-iAT Si )} ~ \\B~^ATS\\ 2 f 
Applying the above in (5.2) gives 

Amin (B -1/2 E [Z] B" 1 / 2 ) = Amin {b~ 1 ' 2 A T SD 2 S T AS" 1 / 2 ) 

= Amin ( S T AB~ 1 A T SD 2 ) 

(5.9) > Amin (S T AB- 1 A T S) A min (iA 2 ) 

^ Amin (S T Ai?- 1 A T S) 

11 ^- 1 / 2^8112 > 


where we used that if B, C £ M raxn are positive definite Amin(-BC') > A m m(-B)A m in(C0- 
Finally 


/ I/, r , 1/9 \ X min (S T AB- 1 A T S) 

(5.10) 1 - V,„ (b-‘ / 2 E [Z] S-‘/ 2 ) < 1 - ' i /MTsp 


The result (5.6) follows by applying Theorem 4.6. □ 

The convergence rate A m i n (S T AS _ 1 A T S) /||.B _ 1 / 2 A T S|||, is known as the scaled 
condition number, and naturally appears in other numerical schemes, such as matrix 
inversion [10, 8]. When Si = Si £ R n is a column vector then 

Pi = ((sff AB- 1 A T sf) /\\B~ 1/2 A t S\\ 2 f , 


for i = 1,... r. In this case, the bound (5.8) is an equality and D 2 is a scaled identity, 
so (5.9) and consequently (5.10) are equalities. For block methods, it is different story, 
and there is much more slack in the inequality (5.10). So much so, the convergence 
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rate (5.7) does not indicate any advantage of using a block method (contrary to 
numerical experiments). To see the advantage of a block method, we need to use 
the exact expression for A m i n (D 2 ) given in (5.8). Though this results in a somewhat 
harder to interpret convergence rate, a matrix paving could be used explore this block 
convergence rate, as was done for the block Kaczmarz method [24, 23]. 

By appropriately choosing B and S, this theorem applied to RK method (3.2), 
the CD-LS method (3.15) and the CD-pd method (3.6), yields the convergence re¬ 
sults (3.4), (3.17) and (3.8), respectively, for single column sampling or block methods 
alike. 

This theorem also suggests a preconditioning strategy, in that, a faster conver¬ 
gence rate will be attained if S is an approximate inverse of B^ 1 ! 2 . For instance, 

in the RK method where B = /, this suggests that an accelerated convergence can 
be attained if S' is a random sampling of the rows of a preconditioner (approximate 
inverse) of A. 

6. Methods Based on Gaussian Sampling. In this section we shall describe 
variants of our method in the case when S is a Gaussian vector with mean 0 £ R m 
and a positive definite covariance matrix E £ R mxm . That is, S = £ ~ N( 0, E). This 
applied to (2.7) results in iterations of the form 


( 6 . 1 ) 


x k+1 = x k 


( T {Ax k - b) 
C T AB~ 1 A T C 


b~ 1 a t c 


Due to the symmetry of the multivariate normal distribution, there is a zero proba¬ 
bility that £ £ Null ( A T ) for any nonzero matrix A. 

Unlike the discrete methods in Section 3, to calculate an iteration of (6.1) we 
need to compute the product of a matrix with a dense vector £. This significantly 
raises the cost of an iteration. Though in our numeric tests in Section 7, the faster 
convergence of the Gaussian method often pays off for their high iteration cost. 

To analyze the complexity of the resulting method let £ = f B~ 1 / 2 A T S , which is 
also Gaussian, distributed as £ ~ N(0,CI), where f 1 = f B~ 1 ^ 2 A t T,AB~ 1 ^ 2 . In this 
section we assume A has full column rank , so that f l is always positive definite. The 
complexity of the method can be established through 


( 6 . 2 ) 


p=l — A min (E [B-^ZB- 1 / 2 ^ 


— 1 Amin 



We can simplify the above by using the lower bound 


E 



^2 n 

~ n Tr (0) ’ 


which is proven in Lemma A.l in the Appendix. Thus 


(6.3) 


1 1_■ , 2 A min (Q) 

n ~ P ~ it Tr (U) 


where we used the general lower bound in (4.2). Lemma A.l also shows that E [££ T /||£|| 2 ] 
is positive definite, thus Theorem 4.6 guarantees that the expected norm of the error 
of all Gaussian methods converges exponentially to zero. This bound is tight upto 














19 


a constant factor. For illustration of this, in the setting with A = I = E we have 
£ ~ N(0,I) and E [££ T /||£|||] = £/, which yields 

1 2 1 

1 -< p< 1 -. 

n 7 t n 

When n = 2, then in Lemma B.l of the Appendix we prove that 

ren = » i/2 

mil Tr (SlV*y 

which yields a very favourable convergence rate. 

6.1. Gaussian Kaczmarz. Let B = I and choose E = I so that S = y ~ 
N(0,I). Then (6.1) has the form 


(6.4) 


k +1 _ k V T (Ax k -b) T 


X = x — 


\\A T v\\i 


A y 


which we call the Gaussian Kaczmarz (GK) method, for it is the analogous method 
to the Randomized Karcmarz method in the discrete setting. Using the formula¬ 
tion (2.3), for instance, the GK method can be interpreted as 

x k+1 = arg min ||a; — x*\\ 2 subject to x = x k + A T yy, y G R. 

zeR’* 


Thus at each iteration, a random normal Gaussian vector 77 is drawn and a search 
direction is formed by A Tr q. Then, starting from the previous iterate x k , an exact line 
search is performed over this search direction so that the euclidean distance from the 
optimal is minimized. 

6.2. Gaussian Least-Squares. Let B = A T A and choose S ~ N( 0, E) with 
E = AA T . It will be convenient to write S = Ay, where y ~ N(0,I). Then method 
( 6 . 1 ) then has the form 


(6.5) 


x k+1 = x k - 


y T A T (Ax k — b) 

WMWl 


which we call the Gauss-LS method. This method has a natural interpretation 
through formulation (2.3) as 


k +1 


= arg min -\\Ax - b\\\ 
x£K n Z 


subject to x = x k + yy, y G R. 


That is, starting from x k , we take a step in a random (Gaussian) direction, then 
perform an exact line search over this direction that minimizes the least squares error. 
Thus the Gauss-LS method is the same as applying the Random Pursuit method [3 
with exact line search to the Least-squares function. 

6.3. Gaussian Positive Definite. When A is positive definite, we achieve an 
accelerated Gaussian method. Let B = A and choose S = y ~ N(Q,I). Method (6.1) 
then has the form 


x k+1 = x k 



( 6 . 6 ) 
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which we call the Gauss-pd method. 

Using formulation (2.3), the method can be interpreted as 

x k+1 = arg min j/( x) = f \x T Ax — b T x j subject to x = x k + yr], ygR. 


That is, starting from x k , we take a step in a random (Gaussian) direction, then per¬ 
form an exact line search over this direction. Thus the Gauss-pd method is equivalent 
to applying the Random Pursuit method [36] with exact line search to f(x). 

All the Gaussian methods can be extended to block versions. We illustrate this 
by designing a Block Gauss-pd method where S £ K TIXl? has i.i.d. Gaussian normal 
entries and B = A. This results in the iterates 


(6.7) 


x k+1 = x k - S(S T AS)~ 1 S T (Ax k - b). 


7. Numerical Experiments. We perform some preliminary numeric tests. Ev¬ 
erything was coded and run in MATLAB R2014b. Let k 2 = ||A||||A'I'|| be the 2—norm 
condition number, where A.1 is a pseudo-inverse of A. In comparing different methods 
for solving overdetermined systems, we use the relative error measure || Ax k — &||2/1|&||2 5 
while for positive definite systems we use \\x k — x*|| J 4/||x*|| J 4 as a relative error mea¬ 
sure. We run each method until the relative error is below 10~ 4 or until 300 seconds 
in time is exceeded. We use Xq = 0 £ R" as an initial point. In each figure we 
plot the relative error in percentage on the vertical axis, starting with 100%. For the 
horizontal axis, we use either wall-clock time measured using the tic-toc MATLAB 
function or the total number of floating point operations (flops). 

In implementing the discrete sampling methods we used the convenient probabil¬ 
ity distributions (5.5). 

All tests were performed on a Desktop with 64bit quad-core Intel(R) Core(TM) 
i5-2400S CPU @2.50GHz with 6MB cache size with a Scientific Linux release 6.4 
(Carbon) operating system. 

Consistently across our experiments, the Gaussian methods almost always require 
more flops to reach a solution with the same precision as their discrete sampling 
counterparts. This is due to the expensive matrix-vector product required by the 
Gaussian methods. While the results are more mixed when measured in terms of 
wall clock time. This is because MATLAB performs automatic multi-threading when 
calculating matrix-vector products, which was the bottleneck cost in the Gaussian 
methods. As our machine has four cores, this explains some of the difference observed 
when measuring performance in terms of number of flops and wall clock time. 

7.1. Overdetermined linear systems. First we compare the methods Gauss- 
LS, CD-LS, Gauss-Kaczmarz and RK methods on synthetic linear systems generated 
with the matrix functions rand and sprandn, see Figure 7.1. The high iteration 
cost of the Gaussian methods resulted in poor performance on the dense problem 
generated using rand in Figure 7.1a. In Figure 7.1b we compare the methods on a 
sparse linear system generated using the MATLAB sparse random matrix function 
sprandn(m, n, density ,rc), where density is the percentage of nonzero entries and 
rc is the reciprocal of the condition number. On this sparse problem the Gaussian 
methods are more efficient, and converge at a similar rate to the discrete sampling 
methods. 

In Figure 7.2 we test two overdetermined linear systems taken from the the Matrix 
Market collection [3]. The collection also provides the right-hand side of the linear 
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(a) rand 



(b) sprandn 


Fig. 7.1: The performance of the Gauss-LS, CD-LS, Gauss-Kaczmarz and RK methods 
on synthetic MATLAB generated problems (a) rand (n,m) with (m;n) = (1000,500) 
(b) sprandn(m, n, density,rc) with (m;n) = (1000,500), density= l/log(nm) and 
rc= 1/y/mn. In both experiments dense solutions were generated with a;* =rand(n, 1) 
and b = Ax*. 


system. Both of these systems are very well conditioned, but do not have full column 
rank, thus Theorem 4.6 does not apply. The four methods have a similar performance 
on Figure 7.2a, while the Gauss-LS and CD-LS method converge faster on 7.2b as 
compared to the Gauss-Kaczmarz and Kaczmarz methods. 

Finally, we test two problems, the SUSY problem and the covtype.binary prob¬ 
lem, from the library of support vector machine problems LIBSVM [5] . These problems 
do not form consistent linear systems, thus only the Gauss-LS and CD-LS methods 
are applicable, see Figure 7.3. This is equivalent to applying the Gauss-pd and CD-pd 
to the least squares system A T Ax = A T b : which is always consistent. 

Despite the higher iteration cost of the Gaussian methods, their performance, in 
terms of the wall-clock time, is comparable to performance of the discrete methods 
when the system matrix is sparse. 

7.2. Bound for Gaussian convergence. Now we compare the error over the 
number iterations of the Gauss-LS method to theoretical rate of convergence given 
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(a) illcl033 
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H 
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(b) we!11033 


Fig. 7.2: The performance of the Gauss-LS, CD-LS, Gauss-Kaczmarz and RK methods 
on linear systems (a) welll033 where (m; n) = (1850, 750), nnz = 8758 and K 2 = 1.8 
(b) illcl033 where (m;n) = (1033; 320), nnz = 4732 and k 2 = 2.1, from the Matrix 
Market [3]. 


by the bound (6.3). For the Gauss-LS method (6.3) becomes 

„ 1 2 , (A T A\ 

1 — ^ ^ ^min ( || a \\2 ) ' 

n 7T \\\A\\ 2 f J 

In Figures 7.4a and 7.4b we compare the empirical and theoretical bound on a random 
Gaussian matrix and the liver-disorders problem [ ]. Furthermore, we ran the 
Gauss-LS method 100 times and plot as dashed lines the 95% and 5% quantiles. 
These tests indicate that the bound it tight for well conditioned problems, such as 
Figure 7.4a in which the system matrix has a condition number equal to 1.94. While 
in Figure 7.4b the system matrix has a condition number of 41.70 and there is some 
much more slack between the empirical convergence and the theoretical bound. 

7.3. Positive Definite. First we compare the two methods Gauss-pd (6.6) and 
CD-pd (3.7) on synthetic data in Figure 7.5. Using the MATLAB function hilbert, 
we can generate positive definite matrices with very high condition number, see Fig¬ 
ure 7.5(LEFT). Both methods converge slowly and, despite the dense system matrix, 
the Gauss-pd method has a similar performance to CD-pd. In Figure (7.5) (RIGHT) 
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time (s) 



(a) SUSY 



(b) covtype-libsvm-binary 


Fig. 7.3: The performance of Gauss-LS and CD-LS methods on two LIBSVM 
test problems: (a) SUSY: (m;n) = (5 x 10 6 ; 18) (b) covtype.binary: (m;n) = 
(581,012; 54). 


we compare the two methods on a system generated by the MATLAB function 
sprandsym (to, n, density, rc, type), where density is the percentage of nonzero 
entries, rc is the reciprocal of the condition number and type=l returns a positive 
definite matrix. The Gauss-pd and the CD-pd method have a similar performance in 
terms of wall clock time on this sparse problem. 

To appraise the performance gain in using block variants, we perform tests using 
two block variants: the Randomized Newton method (3.12), which we will now refer 
to as the Block CD-pd method, and the Block Gauss-pd method (6.7). The size of 
blocks q in both methods was set to q = \Jn.. To solve the q x q system required in 
the block methods, we use MATLAB’s built-in direct solver, sometimes referred to as 
“back-slash”. 

Next we test the Newton system V 2 f(wo)x = —\7f(w o), arising from four ridge- 
regression problems of the form 

(7.1) min f(w) = f ||| Aw - b\\j + f IMli, 

w£K n 

using data from LIBSVM [5]. In particular, we set iijq = 0 and use A = 1 as the 
regularization parameter, whence V/(iuo) = A T b and V 2 /(wo) = A T A + I. 
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iterations 


(a) rand(n, m) 



(b) liver-disorders 


Fig. 7.4: A comparison between the Gauss-LS method and the theoretical bound 

Ptheo = f 1 — A m i n (A T A)/||A|||, on (a) rand(n,m) with (m;n) = (500, 50), = 1-94 

and a dense solution generated with x* = rand(n, 1 ) (b) liver-disorders with 
(m;n) = (345,6) and K 2 = 41.70. 




time (s) 


flops xio 7 



Fig. 7.5: Synthetic MATLAB generated problem. The Gaussian methods are more 
efficient on sparse matrices. LEFT: The Hilbert Matrix with n = 100 and condition 
number ||A||||A -1 || = 6.5953 x 10 19 . RIGHT: Sparse random matrix A = sprandsym 
(n, density, rc, type) with n = 1000, density= l/log(n 2 ) and rc = 1/n = 0.001. 
Dense solution generated with x* =rand(n, 1). 


In terms of wall clock time, The Gauss-pd method converged faster on all problems 
accept the protein problem as compared to CD-pd. The two Block methods had a 
comparable performance on the aloi and the SUSY problem. The Block Gauss-pd 
method converged in one iteration on covtype .binary, and the Block CD-pd method 
converged fast on the Protein problem. 

We now compare the methods on two positive definite matrices from the Ma¬ 
trix Market collection [3], see Figure 7.7. The right-hand side was generated using 
rand(n,l). The Block CD-pd method converged much faster on both problems. 
The lower condition number (k 2 = 12) of the gr_30_30-rsa problem resulted in 
fast convergence of all methods, see Figure 7.7a. While the high condition num¬ 
ber (k 2 = 4.3 ■ 10 4 ) of the bcsstkl8 problem, resulted in a slow convergence for all 
methods, see Figure 7.7b. 

Despite the clear advantage of using a block variant, applying a block method 
that uses a direct solver can be infeasible on very ill-conditioned problems. As an 
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time (s) 
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(a) aloi 


(b) protein 
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(c) SUSY 


(d) covtype.binary 


Fig. 7.6: The performance of Gaussian and Coordinate Descent pd methods on 
four ridge regression problems: (a) aloi: (m;n) = (108,000; 128) (b) protein: 
( m;n ) = (17,766; 357) (c) SUSY: (m;n) = (5 x 10 6 ; 18) (d) covtype .binary: 
(to; n) = (581,012; 54). 


example, applying the Block CD-pd to the Hilbert system, and using MATLAB back¬ 
slash solver to solve the inner q x q systems, resulted in large numerical inaccuracies, 
and ultimately, prevented the method from converging. This occurred because the 
submatrices of the Hilbert matrix are also very ill-conditioned. 

7.4. Comparison between Optimized and Convenient probabilities. We 

compare the practical performance of using the convenient probabilities (5.5) against 
using the optimized probabilities by solving (5.4). We solved (5.4) using the disci¬ 
plined convex programming solver cvx [ ] for MATLAB. 

In Table 2 we compare the different convergence rates for the CD-pd method, 
where p c is the convenient convergence rate (5.7), p* the optimized convergence rate, 
(1 — 1/n) is the lower bound, and in the final “optimized time(s)” column the time 
taken to compute p*. In Figure 7.8, we compare the empirical convergence of the 
CD-pd method when using the convenient probabilities (5.5) and CD-pd-opt, the 
CD-pd method with the optimized probabilities, on four ridge regression problems 
and a uniform random matrix. We ran each method for 60 seconds. 

In most cases using the optimized probabilities results in a much faster conver¬ 
gence, see Figures 7.8a, 7.8c, 7.8d and 7.8e. In particular, the 7.401 seconds spent 
calculating the optimal probabilities for aloi paid off with a convergence that was 
55 seconds faster. The mushrooms problem was insensitive to the choice of proba¬ 
bilities 7.8d. Finally despite p* being much less than p c on covtype, see Table 2, 
using optimized probabilities resulted in an initially slower method, though CD-pd- 
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(a) gr_30_30-rsa 


(b) bcsstkl8 


Fig. 7.7: The performance of the Gauss-pd, CD-pd and the Block CD-pd methods 
on two linear systems from the MatrixMarket (a) gr_30_30-rsa with n = 900, nnz = 
4322 (density= 0.53%) and «2 = 12. (b) bcsstkl8 with n = 11948, nnz = 80519 
(density= 0.1%) and «2 = 4.3 • 10 10 . 


data set 

Pc p* 1 - 1/n 

optimized time(s) 

rand(50,50) 

1 - 2 ■ 10 -H 1 - 3.05 ■ lO" 6 1 - 2.10- 2 

1.076 

mushrooms-ridge 

1 — 5.86 ■ 10 -6 1 — 7.15 ■ 10~ 6 1 — 8.93 ■ 10~ 3 

4.632 

aloi-ridge 

1 — 2.17 ■ 10 -7 1 — 1.26 ■ 10~ 4 1 — 7.81 ■ 10~ 3 

7.401 

liver-disorders-ridge 

1 — 5.16 ■ 10 -4 1 — 8.25 ■ 10 -3 1 — 1.67 ■ 10 -1 

0.413 

covtype.binary-ridge 

1 - 7.57- lCT 14 1 — 1.48 ■ 10 -6 1 — 1.85 ■ 10 -2 

1.449 


Table 2: Optimizing the convergence rate for CD-pd. 


opt eventually catches up as CD-pd stagnates, see Figure 7.8b. 

In Table 3 we compare the different convergence rates for the RK method. In 
Figure 7.9, we then compare the empirical convergence of the RK method when using 
the convenient probabilities (5.5) and RK-opt, the RK method with the optimized 
probabilities by solving (5.4). The rates p* and p c for the rand(500,100) problem 
are similar, and accordingly, both the convenient and optimized variant converge at 
a similar rate in practice, see Figure 7.9b. While the difference in the rates p* and 
p c for the liver-disorders is more pronounced, and in this case, the 0.83 seconds 
invested in obtaining the optimized probability distribution paid off in practice, as the 
optimized method converged 1.25 seconds before the RK method with the convenient 
probability distribution, see Figure 7.9a. 

We conclude from these tests that the choice of the probability distribution can 
greatly affect the performance of the method. Hence, it is worthwhile to develop 
approximate solutions to (5.3). 

8. Conclusion. We present a unifying framework for the randomized Kaczmarz 
method, randomized Newton method, randomized coordinate descent method and 
random Gaussian pursuit. Not only can we recover these methods by selecting ap¬ 
propriately the parameters S and B , but also, we can analyse them and their block 
variants through a single Theorem 4.6. Furthermore, we obtain a new lower bound 
for all these methods in Theorem 4.4, and in the discrete case, recover all known 
convergence rates expressed in terms of the scaled condition number in Theorem 5.2. 

The Theorem 5.2 also suggests a preconditioning strategy. Developing precondi¬ 
tioning methods are important for reaching a higher precision solution on ill-conditioned 
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data set 

Pc p* 1 - 1 /ri 

optimized time(s) 

rand(500,100) 

1 -3.37- 10 tl 1 - 4.27- 10- a 1 - 1 ■ 1CT 2 

33.121 

liver-disorders 

1 — 5.16 ■ 10 -4 1 — 4.04 • 10 -3 1 — 1.67 • 10” 1 

0.8316 


Table 3: Optimizing the convergence rate for randomized Kaczmarz. 


problems. For as we have seen in the numerical experiments, the randomized meth¬ 
ods struggle to bring the solution within ICO 2 relative error when the matrix is ill- 
conditioned. 

This is also a framework on which randomized methods for linear systems can 
be designed. As an example, we have designed a new block variant of RK, a new 
Gaussian Kaczmarz method and a new Gaussian block method for positive definite 
systems. Furthermore, the flexibility of our framework and the general convergence 
Theorems 4.6 and 4.4 allows one to tailor the probability distribution of S' to a par¬ 
ticular problem class. For instance, other continuous distributions such uniform, or 
other discrete distributions such Poisson might be more suited to a particular class of 
problems. 

Numeric tests reveal that the new Gaussian methods designed for overdetermined 
systems are competitive on sparse problems, as compared to the Kaczmarz and CD- 
LS methods. The Gauss-pd also proved competitive as compared to CD-pd on all 
tests. Though, when applicable, the combined efficiency of using a direct solver and 
an iterative procedure, such as in Block CD-pd method, proved the most efficient. 

The work opens up many possible future venues of research. Including inves¬ 
tigating accelerated convergence rates through preconditioning strategies based on 
Theorem 5.2 or by obtaining approximate optimized probability distributions (5.4). 

Acknowledgments. The authors would like to thank Prof. Sandy Davie for use¬ 
ful discussions relating to Lemma B.l, and Prof. Joel Tropp for invaluable suggestions 
regarding Lemma A.l. 
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Appendix A. A Bound on the Expected Gaussian Projection Matrix. 


Lemma A.l. Let D £ R nx ™ he a positive definite diagonal matrix, U £ 
orthogonal matrix and Ll = UDU T . If u ~ N(0,D) and f ~ N(0,Q) then 


(A.l) 

and 

(A.2) 
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u T ej 


= UE 


uu 


u T , 


E 


“ 7r Tr (D)' 


2 n 
>~ -- 


Proof. Let us write 5(£) for the random vector £/||£|| 2 (if £ = 0, we set 5(£) = 0). 
Using this notation, we can write 


E [C(e T 0 _1 e T ] = E [5(0(5(£)) T ] = Cov [5(0], 
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where the last identity follows since E [£(£)] = 0, which in turn holds as the Gaussian 
distribution is centrally symmetric. As £ = Uu, note that 


S(u) 


u T t 

\\u T tih 


U ^ 

IlClb 


U T S(t). 


Left multiplying both sides by U we obtain US(u) = from which we obtain 

Cov [5(0] = t/Cov [S{u)} U T , 


which is equivalent to (A.l). 

To prove (A. 2), note first that M = f E [ uu T /u T u] is a diagonal matrix. One can 
verify this by direct calculation (informally, this holds because the entries of u are 
independent and centrally symmetric). The ith diagonal entry is given by 


Mu — E 


E n 

.7=1' 


As the map (x, y) —)• x 2 /y is convex on the positive orthant, we can apply Jensen’s 
inequality, which gives 


E 


T n u 2 
^.7=1 “7 
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(E[ 


2 D, 


EU E [«?] nTr (DY 


which concludes the proof. □ 

Appendix B. Expected Gaussian Projection Matrix in 2D. 

Lemma B.l. Let £ ~ N(0, fl) and £ K 2x2 be a positive definite matrix, then 


(B.l) 



n 1 / 2 

Tr (ft 1 / 2 )' 


Proof. Let E = UDU T and u ~ N(0,D). Given (A.l) it suffices to show that 

D 1 / 2 


(B.2) 


which we will now prove. 


Cov [S(u)] = 


Tr (D 1 / 2 ) ’ 


Let a 2 and cr“ be the two diagonal elements of D. First, suppose that a x = a y . 


Then u = a x y where 77 ~ N{ 0, 1) and 

n.Tl 
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Now suppose that u x Y a y We calculate the diagonal terms of the covariance matrix 
by integrating 


E 


1 r x 2 

u 2 + u\ J 2Tra x a y J R 2 x 2 + y 2 


-\(x 2 /ol+y 2 /crl) dxdy 


Using polar coordinates x = Rcos(0) and y = ZZsin(0) we have 

r 2 r 2ir poo 2 

(B.3) / -2 --e-K^/^+s l2/a l)dxdy= / / Rcos 2 (e)e~ S ^ c(9) dRde 
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where C(9) = f (cos(0) 2 /cr 2 + sin(0) 2 /<7 2 ) . Note that 


(B.4) 


Re 


C(fl)fl 2 


dR = — 


C(0)R 2 


C(9) 
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wr 


This applied in (B.3) gives 


E 


ul 


1 f 2 ” cos 2 (g) = b r cos 2 (9) 

2na x a y J 0 cos(0) 2 /cr 2 + sin((9) 2 / <j 2 n J 0 cos 2 (9) + b 2 sin 2 (6) 


where b = a x la y . Multiplying the numerator and denominator of the integrand by 
sec 4 (a;) gives the integral 


E 


l 2 J 


b r sec 2 (9) 

7 t Jo sec(0) 2 (l + 6 2 tan 2 (0)) 


Substituting v = tan(0) so that u 2 + 1 = sec 2 (9), dv = sec 2 (9)d9 and using the partial 
fractions 


(v 2 + 1) (1 + b 2 v 2 ) 1 — b 2 \v 2 + 1 b 2 v 2 + 1/ 


b 2 


gives the integral 


dv 
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(B.5) 


{v 2 + 1) (1 + b 2 v 2 ) 1 — b 2 

1 


1 - b 2 


(arctan(u) — fearctan(fw)) 
(■9 — 6arctan(6tan(0))). 


To apply the limits of integration, we must take care because of the singularity at 
9 = 7t/ 2. For this, consider the limits 


lim arctan(5tan(0)) = —, 
e-KTr/2)- V K 2 


lim arctan(6tan(0)) =-. 

0—Ktt/2)+ V K 2 


Using this to evaluate (B.5) on the limits of the interval [0, 7r/2] gives 

t 


lim -—r- (6 — 6arctan(5tan(0))) 

t->( 7r/2)- 1 — b 2 


1 7J- 7T 

1 - 6 2 2 1 J 2(1 + 6 ) ■ 


Applying a similar argument for calculating the limits from 7r/2 + to 7r, we find 


E 


2b it <j x 

u\ + u 2 71 2(1 + b) a y + a x 


Repeating the same steps with x swapped for y we obtain the other diagonal element, 
which concludes the proof of (B.2). □ 
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(b) covtype.libsvm.binary 



(c) liver-disorders-ridge 




(e) uniform-random-50X50-opt 


Fig. 7.8: The performance of CD-pd and optimized CD-pd methods on (a) 
aloi: ( m;n ) = (108,000; 128) (b) covtype .binary: (m;n) = (581, 012; 54) (c) 
liver-disorders: (m; n) = (345,6) (c)mushrooms: ( m;n ) = (8124,112) (d) 

uniform-random-50X50. 
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—O—■ Kaczmarz 
—I— Kaczmarz-popt 




Ok 


"X 


"X.S. 


"X 

NP-. 


10 


-io. 


0.1 0.2 
time (si 
(b) rand(500,100) 


0.3 


Fig. 7.9: The performance of Kaczmarz and optimized Kaczmarz methods on (a) 
liver-disorders: ( m\n ) = (345,6) (b) rand(500,100) 















